Bias 1 Running Head : RETRIEVAL , MONITORING AND REPORT BIAS
نویسنده
چکیده
Performance on tests where there is control over reporting (e.g., cued-recall with the option to withhold responses) can be characterized by four parameters: freeand forced-report retrieval (correct responses retrieved from memory when the option to withhold responses is exercised and when it is not, respectively), monitoring (discrimination between correct and incorrect potential responses) and report bias (willingness to report responses). Typically, researchers do not examine all these components in cued test performance; blanks are sometimes counted the same as errors, meaning that the (freereport) performance index is contaminated with report bias and monitoring ability. In this research, a two-stage testing procedure is described that allows measures of freeand forced-report retrieval, monitoring and bias to be derived from the original encoding specificity experiments (Thomson & Tulving, 1970). The results show that their cue reinstatement manipulation affects free-report retrieval, but once report bias and monitoring effects are removed by forcing output, retrieval is unaffected. Retrieval, Monitoring and Report Bias 3 Strong Cues are not Necessarily Weak: Thomson & Tulving (1970) and the Encoding Specificity Principle Revisited The accuracy of memory reports can be improved by the exertion of some control over how much and what is reported. For example, people who choose only to report memories about which they are very confident are likely to have more accurate memory reports than people who report everything that comes to mind (e.g., see Koriat & Goldsmith, 1996a). Although the effect of report option on memory accuracy is well known, very few analytical tools are currently available to memory researchers to estimate or otherwise control report bias. Within the context of free recall, Roediger and Payne (1985) have stated, “it is rather puzzling that the problem of recall criteria [report option] has received so little attention, since some of our most popular theories make strong assumptions about their role, as do commonsense folk theories” (p. 7). In laboratory research, where participants are required to give responses to a finite set of cues (e.g., questions, words, pictures, word stems and so on), report option has generally been considered a nuisance factor. Researchers typically try to control or eliminate any influence of report option by, for example, forcing participants to respond to every test cue, guessing if necessary. However, not all researchers force responses to all cues, and in some cases, this allows metacognitive and decision making parameters to distort the memory index. Indeed, the experiment described in this paper, which revisits the effect of context reinstatement on memory, demonstrates that previous failures to control report option in a cued testing paradigm may have led to some misleading interpretations of experimental data. The problem of dealing with the effect of report option on memory performance is just as serious in many everyday memory situations, where memory correspondence and accuracy is emphasized. For example, consider the literature on the cognitive interview and its effect on memory (e.g., Fisher & Geiselman, 1992; see Memon & Higham, 1999 for a recent review). Developers of the cognitive interview have incorporated techniques into the interview that have been shown to enhance memory performance in laboratory studies (e.g., mental reinstatement of context). The hope is that these techniques will increase the likelihood that interviewees will be able to remember details and/or events that otherwise would not have been recalled during the interview process. In other words, there is an assumption that these techniques actually enhance the accessibility of memory traces. However, some concern has Retrieval, Monitoring and Report Bias 4 been raised lately over the possibility that the cognitive interview merely affects the report criterion (i.e., renders it more liberal relative to a standard interview), and has no effect on the accessibility of memory traces (e.g., see Fisher, 1996; Higham & Roberts, 1996; Memon & Higham, 1999; Memon & Stevenage, 1996 and Roberts & Higham, 2001 for discussion). Similar claims have been made regarding the effect of hypnosis on memory performance (e.g., Dywan & Bowers, 1983; Klatzky & Erdelyi, 1985); that is, hypnosis serves merely to increase the amount of information provided in the memory report, not to make memory traces more accessible. Despite all this concern, however, there is no well-accepted analytical tool or method that can be used to gain an estimate of report bias in situations where reporting is optional. One purpose of this paper is to introduce a methodology that is simple to apply to cued testing procedures and which allows various memory parameters including report bias to be determined. Each parameter reflects a distinct aspect of memory performance that is qualitatively and quantitatively separable from the others. Koriat and Goldsmith (1996a; see also 1994, 1996b, 1996c and Koriat, Goldsmith & Pansky, 2000) also gained an estimate of participants’ report criterion when they were given the opportunity to withhold responses. To do so, they gave participants a set of general knowledge questions and had them provide answers to the questions under conditions of either forced-report or free-report. For forcedreport, participants had to provide an answer to every test question, guessing if necessary. For freereport, participants answered questions with an accuracy incentive; they were penalized (by having money subtracted from their total) for incorrect answers and rewarded (with money being added to their total) for correct answers. However, they could avoid being penalized (and being rewarded) by leaving questions blank, in which case no money was added or subtracted from their total. It is assumed in Koriat and Goldsmith’s model that an input question causes a “best-candidate” answer to be retrieved from long-term memory and considered for report. A monitoring mechanism estimates the probability that the best candidate is the correct answer (the assessed probability--Pa) and assigns it to the candidate. In the free-report condition, the candidate is reported if its associated Pa is greater than or equal to some predetermined report criterion probability -Prc -otherwise, it is withheld. Prc reflects the report criterion, and can be estimated with their model by comparing performance under freeversus forced-report conditions. However, their model also allows for other memory parameters to be determined. For example, monitoring effectiveness is “the extent to which the assessed probabilities Retrieval, Monitoring and Report Bias 5 (Pa) successfully differentiate correct from incorrect candidate answers” (Koriat & Goldsmith, 1996a, p. 494). Also, retrieval is the percentage of input questions that are answered correctly under forced-report conditions. The model and methodology advocated in this paper also allows for the estimation of monitoring and retrieval in addition to report bias, only this is achieved using a model analogous to that underlying signal detection theory (SDT). Although Koriat and Goldsmith’s (1996a) model is similar in spirit to an SDT model in that there is an attempt to separate response effects from memory effects, they argue that an SDT model is unsuitable for gaining separate estimates of bias, monitoring and retrieval. Thus in the forced-report “old/new” paradigm to which signal detection methods are typically applied, control is isolated in terms of the parameter â, yet retention (overall memory strength) and monitoring effectiveness (the extent to which a person’s confidence distinguishes oldstudied from new-foil items) cannot be operationally or conceptually separated: Both are equally valid interpretations of d’ (Koriat & Goldsmith, 1996a, p. 506). Koriat and Goldsmith’s arguments are true for the typical application of SDT to standard yes/no testing paradigms (e.g., old/new recognition); it is impossible to obtain separate estimates of bias, monitoring and retrieval with standard SDT because there are only two parameters (â and d’). However, because report option is generally not an issue with yes/no recognition research, (participants typically respond “yes” or “no” to all test items), there is no need to gain these separate estimates. The more relevant question is whether or not these separate estimates can be obtained using an SDT analogue model that can be applied to paradigms where a report option is exercised. My research addresses exactly this question and it suggests that such an application is not only feasible, but has the potential to be very informative about memory and the decision processes involved with tests containing a finite set of cues. The SDT Model and Indices of Performance The psychological model that is adopted in the present research is essentially the same as the one presented in Koriat and Goldsmith (1996a), but the methodology, performance indices and methods of data analysis are quite different. In response to an input cue (word stem/fragment, word associate, general knowledge question, and so on), it is assumed that a “best-candidate” answer is retrieved from long-term memory and considered for report. The best candidate answer is the one to which the monitoring mechanism assigns the highest confidence-of-accuracy. Like Koriat and Goldsmith, no Retrieval, Monitoring and Report Bias 6 claims are made regarding the nature of retrieval (e.g., whether the candidate is retrieved from episodic or semantic memory); only the accuracy of the retrieval and the decision processes that follow retrieval of the best candidate are of interest. Participants are assumed to set a report criterion along the confidence dimension such that any items with accuracy confidence above the criterion are reported, whereas items with accuracy confidence below the criterion are withheld. Setting of the report criterion is assumed to depend on situational demands and accuracy incentives just as with Koriat and Goldsmith’s model. Both Koriat and Goldsmith’s (1996a) and the SDT approach require information regarding the accuracy of both reported candidates and initially withheld candidates. In other words, the frequencies are needed for all four cells of the 2 (accurate/inaccurate candidate) X 2 (report/withhold response) contingency table shown as Table 1. The a cell in Table 1 represents the number of responses that are correct and reported in free-report; b is the number of incorrect/reported responses; c is the number of initially withheld responses that are correct at forced-report, and d is the number of initially withheld/incorrect responses at forced-report. To obtain these frequencies, participants’ performance was evaluated under free-report and forced-report conditions, just as with Koriat and Goldsmith’s (1996a) methodology. For free-report, participants were tested with a finite number of cues that required responses, but an incentive for leaving responses blank was provided with a pay-off schedule; a reward was attached to correct-reported answers, a penalty was attached to incorrect-reported answers, but no penalty was attached to blanks. The number of correct and incorrect responses made during this free-report phase of the experiment provide data for cells a and b, respectively, in Table 1. In forced-report, participants were persuaded to provide responses to all test cues, guessing if necessary. The number of new correct and incorrect initially withheld answers revealed under these forced-report circumstances provided data for cells c and d, respectively, in Table 1. Following SDT, in the present research, correct and incorrect candidates are assumed to form two overlapping distributions over a dimension of subjective confidence-of-accuracy, with correct candidates having higher confidence on average than incorrect candidates. The assignment of a reasonably placed report criterion on the confidence dimension splits the correct and incorrect candidate distributions into two, rendering four areas corresponding to hits, false alarms, misses and correct Retrieval, Monitoring and Report Bias 7 rejections as in standard SDT. Once the hit and false alarm rates are derived from the a, b, c and d frequencies (see Table 1 for the formulae), various memory parameters can be derived. The index chosen for monitoring was A’, a nonparametric measure of discrimination (e.g., Grier, 1971). Essentially this measure indicates the degree to which participants have the tendency to report correct candidates and withhold incorrect ones. The index chosen for report bias was B”D (Donaldson, 1992), a nonparametric measure of bias. This index indicates participants’ tendency to report candidates, regardless of their accuracy. Other measures of discrimination (monitoring) and (report) bias might have been chosen, but nonparametric measures avoid the strong distributional assumptions that would be necessary to use, say, d’ and â. The performance indices chosen here for monitoring and report bias differ from Koriat and Goldsmith’s (1996a) in that they are based on the actual reporting behavior of participants, not on their confidence ratings. I will argue that such measures are more direct indices of participants’ performance and they are not subject to problems associated with the analysis and interpretation of confidence ratings. These problems, and the implications of the difference between the two types of performance index, are discussed in more detail below. In addition to the SDT measures of monitoring and bias, two type of retrieval were also calculated. The first, free-report retrieval, refers to the number of correct responses on the cued memory test, divided by the number of cues on the test, when participants are permitted to leave cues blank. The second, forced-report retrieval, refers to the number of correct responses on the cued memory test, divided by the number of cues on the test, when participants are forced to give a response to every cue. The formulae based on the a, b, c and d frequencies that were used to derive the retrieval measures are also shown in Table 1. Overview of the Experiment Whenever participants exercise a report option, they potentially fail to report correct candidates that have been retrieved, but this cannot occur under forced-report conditions. Thus, free-report retrieval is generally an underestimate of forced-report retrieval. The SDT model assumes that the degree of this underestimation depends on at least two factors: 1) The placement of the report criterion. The more conservative the placement of the report criterion, the more free-report retrieval will be an underestimation of forced-report retrieval (i.e., c will be Retrieval, Monitoring and Report Bias 8 high; see Table 1). 2) The level of monitoring. The less effective the monitoring, the more free-report retrieval will be an underestimation of forced-report retrieval. If participants have perfect monitoring and report only correct answers at their report criterion (free-report), then no new correct responses will be revealed when reporting is forced, so retrieval will be unaffected. However, if monitoring is less than perfect, then some correct answers are likely to be withheld at the report criterion. When these correct answers are revealed in forced-report, retrieval will be increased relative to free-report retrieval. These two assumptions were used to clarify the interpretation of Thomson and Tulving’s (1970) results demonstrating the encoding specificity principle. In a study phase, participants were given targets to memorize that were paired with weak associates (e.g., bats-BLOOD, where BLOOD was the target and bats was the weak associate). At test, participants were asked to recall target words under three different conditions. In the weak cue condition, the same weak cue that was presented with the target at study was presented again and participants were asked to recall the associated target (e.g., bats-?). In the strong cue condition, a new strong associate of the target was presented and participants were asked to recall the related target word (e.g., donor-?). Finally, as a baseline, participants in the no cue condition were simply asked to recall as many of the targets as possible without the assistance of any cues. In Thomson and Tulving’s (1970) original version of this experiment, they found that retrieval in the weak cue condition was much better than retrieval in the strong cue condition. In fact, the strong cues did not seem to augment retrieval at all; for the most part, performance to strong cues was no better than performance to no cues. In their abstract, they stated that after participants studied weak associate-target pairs, Recall of the TBR [to-be-remembered] words in the presence of these [weak] cues was greatly facilitated in comparison to noncued recall; recall of the TBR words in the presence of their strongest normative associates, which had not been seen at input, did not differ from noncued recall (p. 255). However, Thomson and Tulving (1970) did not force output to all test cues, rendering their measure of recall analogous to free-report retrieval, which is potentially contaminated with monitoring and report bias. To investigate whether Thomson and Tulving’s results and conclusions were affected by Retrieval, Monitoring and Report Bias 9 these variables, I will apply the SDT analogue model outlined above and determine the effect of cue type (weak versus strong) on retrieval, report bias and monitoring. Also, by comparing freeand forced-report retrieval, the contaminating effect of report bias on free-report retrieval can be assessed. Method Participants Forty-eight undergraduates participated in the experiment either for course credit or financial compensation. Thirty-two participants were assigned to two cued-recall groups: 16 to the phase group and 16 to the trial group. Sixteen participants were assigned to the no cue group. Participants were tested in groups of 1-4 at individual work stations. All participants in a given session were assigned to the same experimental group. Design & Materials One hundred target words, each with two associated cue words, were chosen from the Edinburgh Associative Thesaurus. One cue word for each target word was a strong associate of the target (mean probability of target production for all 100 strong cues = 35%), whereas the second was a weak associate (mean probability of target production for all 100 weak cues = 1%). No word was repeated across the lists of weak cues, strong cues or targets. The 100 targets, 100 strong cues and 100 weak cues are presented in Appendix A. The experiment was divided into two or three phases, depending on the particular experimental group. In phase 1, all 100 targets, along with their weak associates, were presented in pairs in a random order to all participants for study. For counterbalancing purposes in the cued-recall (trial and phase) groups, half the participants were presented in phase 2 with strong cues to elicit targets 1-50, and presented with weak cues to elicit targets 51-100, whereas this was reversed for the other half of the participants. This counterbalancing procedure was not necessary for the no cue group because no associates were presented to participants in phase 2. In phase 2, participants in the trial and phase groups were presented with retrieval cues. Participants were initially given the choice of providing a response to a given cue or leaving it blank (freereport; see procedure below). However, responses to cues that were left blank were obtained later in the experiment using two forced-report methods. For the phase group, cues that were left blank were presented to participants again in a third phase of the experiment, which commenced immediately after Retrieval, Monitoring and Report Bias 10 all 100 cues were presented in phase 2. In the trial group, immediately after a cue was left blank, the cue was presented again and a response required before moving on to the next trial. Thus, there were two experimental phases in the trial group, but three phases in the phase group. The order of presentation of the cues in phase 2 was randomized for each participant. After randomization (which was different for each participant), data from the first six trials of phase 2 in both the trial and phase groups were counted as practice trials and dropped from the analysis. Thus, analyses in these groups were based on 94 items, with an average of 47 (range: 44-50) data points for each of the strong and weak conditions. For the no cue group, immediately following the study phase, participants were asked to recall as many upper case words from phase 1 as possible, without the assistance of any cues. Thus, there were two experimental phases in the no cue condition. Because there were no cues presented to participants in the no cue group, the effect of strong/weak manipulation could not be assessed, nor could the effect of freeversus forced-report. The purpose of the no cue condition was merely to establish participants’ unassisted free-report retrieval ability, given the particular study conditions, which were held constant across all three experimental groups. Procedure In the phase 1, participants in all three experimental groups studied 100 weak cue-target word pairs presented individually for 3 s each, centered on a computer monitor. The cue words were displayed in lower case letters to the left of the target words that were presented in upper case letters. Participants were instructed to study the upper case words for a later memory test, but to attend to the lower case words as possible cues to assist in recalling the upper case words at test. These instructions were given verbally by the experimenter as well as appearing on the computer monitor. The order of presentation of the word pairs was randomized. In phase 2, participants in the phase group were instructed that they would be presented with words, one at a time on the computer monitor. Each cue word was centered on the monitor and displayed with a question mark to its immediate right indicating that a response was requested. Participants were told that each word presented during this phase had an upper case word from the training list that was related to it and to use the word as a cue to assist in recalling the upper case word. Participants were also informed that each correct answer would earn 1 point and each incorrect answer Retrieval, Monitoring and Report Bias 11 would cost 4 points, but that they could avoid the point system by not entering a response to the cue; blank (no response) trials would neither earn nor cost any points. If participants chose to enter a response, they typed the response on the computer keyboard. If they chose not to respond, they entered “B” (for “blank”). If a response was entered, participants were then asked to give a confidence rating on a separate screen, using a scale from 1 to 6, where 1 = extremely low confident correct, 2 = very low confident correct, 3 = low confident correct, 4 = high confident correct, 5 = very high confident correct, 6 = extremely high confident correct. This scale appeared on the screen whenever a confidence rating was required in any of the experimental groups. After entering a confidence rating, the cumulative points, the points earned/lost for that particular trial (either +1 or -4) and the number of trials remaining were displayed. After receiving feedback, participants pressed the space bar to initiate the next trial. If “B” was entered, no confidence rating was gathered, and the program skipped to the feedback display. The feedback display for blank trials again consisted of the cumulative points, points for that trial (0) and the number of trials remaining. Participants in the phase group completed all one hundred trials before moving on to phase 3. During phase 3, all cues that participants had chosen to leave blank in phase 2 were presented again, one at a time. Participants were informed that blanks were no longer a viable response option to the cues; a response was required even if it was necessary to guess and the experimental program would not advance to the next stage until a response was entered (“B” was no longer accepted as a response). Participants were also told, however, that responding in this phase did not affect point totals because the point system was no longer operative. Immediately after entering a response to each cue in phase 3, participants were asked for a confidence rating using the same 6-point confidence scale as in phase 2. No feedback screen was displayed during phase 3 because point totals were unaffected by the accuracy of the responses. The procedure for participants in the trial group was similar to that for the phase group except in the method used to collect responses for cues initially left blank. Instead of gathering responses to cues initially left blank in a separate third phase of the experiment as in the phase group, responses in the trial group were gathered on a trial-by-trial basis. As in the phase group, immediately after participants entered a “B” to leave a trial blank, the feedback screen was displayed. However, instead of advancing to the next cue when the space bar was pressed, the same cue was presented again in the Retrieval, Monitoring and Report Bias 12 trial group and participants were required to enter a response, guessing if necessary. No points were at stake for the second presentation of the cue. As in the phase group, confidence ratings (1-6) were gathered for both initially reported and initially withheld responses. Following the study phase, participants in the no cue group were asked to try to recall as many of the upper case words from phase 1 as possible. Participants entered their responses into the computer, typing each word next to a prompt (Response ->) which appeared in the center of the computer monitor. After each response, a confidence rating was gathered using the same 6-point scale as in the trial and phase groups. When participants could remember no more words, they were instructed to type DONE, which terminated the program. Results & Discussion Two 2X2 contingency tables, similar to that shown in Table 1, were created for each participant in the trial and phase groups. The first table corresponded to performance for strong cues and the second to performance for weak cues. Given that there were 94 valid test trials (100 in total minus 6 practice trials), the sum of the frequencies across both contingency tables for each participant was 94, with as many as 50 and as few as 44 observations for a given table. Once both contingency tables were created for each participant, it was possible to determine four performance measures from each table using equations presented in Table 1: freeand forced-report retrieval, monitoring and report bias. Table 2 shows the frequencies for each table (weak-phase, strong-phase, weak-trial and strong-trial) averaged across participants. Before conducting the SDT analyses, the hit and false alarm rates were adjusted according to Snodgrass and Corwin’s (1988) recommendation to avoid having to eliminate participants because of undefined values (i.e., .5 was added to the numerator and 1 was added to the denominator). However, for some non-SDT analyses that did not involve hit and false alarm rates, participants with undefined cells were eliminated from the analysis. The exact number of participants eliminated can be determined from the degrees of freedom presented with the analysis. Retrieval Freeand forced-report retrieval is shown in Figure 1. The method used to collect initially withheld responses (trial versus phase) had no effect on either the free-or forced-report retrieval measures, nor did it interact with any other variables, largest F(1,30)=1.27, so the means presented are Retrieval, Monitoring and Report Bias 13 collapsed across cued-recall group. Retrieval in the no cue condition is shown as the solid line in the figure. The pattern of results indicates that when participants exercised the option to withhold answers (free-report), their retrieval for strong cues was much worse (.07) than if reporting was forced (.26). However, retrieval for weak cues was unaffected by report option (free: .26; forced: .28). The pattern was confirmed by a 2 (group: trial/phase) X 2 (cue type: strong/weak) X 2 (report type: free/forced) mixed Analysis of Variance (ANOVA) which revealed a significant cue type (strong/weak) by report type (free/forced) interaction, F(1,30)=97.22, MSE=.002, p<.001. A post-hoc Tukey honest significant difference (HSD) test on the interaction revealed that performance in the free-strong cell was significantly less than performance in any of the other three conditions (free-weak, forced-strong, forced-weak), all ps<.001. Performance between the free-weak, forced-strong and forced-weak cells did not differ, smallest p=.14. To compare the obtained results to Thomson and Tulving’s (1970), it is important to reemphasize that they treated blank responses the same as incorrect responses; that is, they used free-report retrieval as their index of memory performance. Examining Figure 1, it is apparent that the results for free-report retrieval replicated their pattern exactly; weak cue performance was superior to no cue performance, but strong cue performance was not. In support of this pattern, a one-way ANOVA (group: trial/phase/no cue) comparing free-report retrieval in the weak cue conditions (trial and phase) to retrieval in the no cue condition revealed a significant main-effect, F(1,45)=14.89, MSE=.013, p<.001. A post hoc Tukey HSD test revealed that free-report retrieval was significantly greater in the weak cue conditions (trial = .24; phase = .29) than in the no cue condition (.08), both ps <.001, and that the weak cue conditions did not differ from each other, p = .43. An analogous one-way ANOVA comparing freereport retrieval in the strong cue conditions with the no cue condition did not reveal a significant effect, F<1. Thus, in free-report, only weak cues elicited performance that was superior to the no cue group. However, a very different pattern of results is revealed if the forced-report measure is examined in Figure 1, where report bias and monitoring could no longer contaminate performance. Now, both strong and weak cue retrieval was much better than no cue retrieval. This conclusion was supported by two one-way ANOVAs (group: trial/phase/no cue) on the forced-report data, one for strong cues and one for weak cues. Both ANOVAs revealed significant main-effects, F(1,45)=31.97, MSE=.005, p<.001 and F(1,45)=16.75, MSE=.014, p<.001, for strong and weak cues, respectively. Post hoc Tukey HSD tests Retrieval, Monitoring and Report Bias 14 indicated that forced-report retrieval in both the strong cue conditions (trial = .25; phase = .27) and the weak cue conditions (trial = .26; phase = .31) was superior to no cue retrieval (.08), all ps<.001. Forced-report retrieval did not differ between the strong cue conditions in the first analysis, p=.76, nor did it differ between the weak cue conditions in the second analysis, p=.51. Thus, when output was forced, both weak and strong cues facilitated retrieval relative to the no cue group. Monitoring The model assumes that free-report retrieval will underestimate forced-report retrieval more if monitoring is low than if it is high. Thus, the weak cue condition, where the underestimation of forcedreport retrieval was small, should have higher monitoring than the strong cue condition, where the underestimation was large. Consistent with this assumption, monitoring was much better for weak cues (.88) than for strong cues (.61). The reliability of this difference was confirmed with a 2 (group: trial/phase) X 2 (cue type: strong/weak) ANOVA on the monitoring measure (A’), which revealed only a significant main-effect of cue type, F(1,30)=100.50, MSE=.012, p<.001. Because monitoring was good in the weak cue condition, then participants were in a position to set their report criterion such that the vast majority of correct answers were reported. Consequently, making the report criterion more liberal by forcing report did not elicit much new correct information, so retrieval was unaffected by the free/forced manipulation. Conversely, poor monitoring in the strong cue condition meant that many correct responses were withheld at the report criterion. Forcing responses, however, elicited these correct responses and caused retrieval to increase in the strong cue condition. Thus, strong cue retrieval was greatly affected by the free/forced manipulation. Report Bias The model also assumes that free-report retrieval will underestimate forced-report retrieval to a greater extent if the report criterion is conservative than if it is liberal. The monitoring difference between the strong and weak cue conditions described in the previous analysis might have been enough to account for the differential level of forced-report retrieval underestimation that was found. Nonetheless, the differential underestimation might also have been partially due to participants having a more conservative report criterion in the strong cue condition than in the weak cue condition. The results indicated that report bias, as measured by the SDT index (B”D), was indeed more conservative in the strong cue condition (.62) than in the weak cue condition (-.31). A 2 (group: trial/phase) X 2 (cue type: Retrieval, Monitoring and Report Bias 15 strong/weak) mixed ANOVA on B”D calculated for each participant revealed only a significant main-effect of cue-type, F(1,30)=85.93, MSE=.061, p<.001. Some caution should be exerted when interpreting the cue type effect on B”D. Although the effect could be interpreted as a shift in the placement of the criterion between the two cue types, it is also possible that a single criterion was used for both classes of cues, but that mean confidence differed between them. If a single report criterion were placed between the means of strong and weak cue confidence distributions, B”D would be conservative relative to the strong cue distribution, but liberal relative to the weak cue distribution. This interpretation of the difference in report bias seems more likely than one where participants are assumed to be adopting multiple criteria. Importantly, however, debates regarding the source of the bias effect are not relevant to the testing of the report bias predictions made by the SDT model; conservative placement means high free-forced retrieval discrepancy regardless of whether there has been a “true” criterion shift or not. Support for a difference between mean confidence ratings in the strong and weak cue conditions was found from a 2 (group: phase/trial) X 2 (cue type: strong/weak) X 2 (accuracy: correct/incorrect) ANOVA on the mean confidence ratings. It revealed a main-effect of cue type, F(1,30)=312.04, MSE=.253, p<.001; responses to weak cues were assigned higher confidence (3.41) than answers to strong cues (1.84). The main-effect of accuracy was also significant, F(1,30)=525.01, MSE=.207, p<.001, as was the accuracy by cue type interaction, F(1,30)=374.81, MSE=.172, p<.001. Correct answers were assigned higher confidence (3.55) than incorrect answers (1.71), but this difference was greater for weak cues (+3.26) than for strong cues (+.42). This interaction is consistent with the monitoring analysis above that revealed higher monitoring in the weak cue condition than in the strong cue condition. Other measures of Monitoring and Report Bias Nelson (1984) has advocated the Kruskal-Goodman gamma (γ) correlation as the most suitable accuracy measure of metamemory predictions (although see Schraw, 1995 and commentary by Nelson, 1996 and Wright, 1996). Similarly, Koriat and Goldsmith (1996a) used γ correlation between confidence and accuracy as a measure of monitoring. So that the SDT measure (A’) could be compared directly to γ, two γ coefficients were calculated for each participant in the trial and phase groups. The first, γ-rc, is the correlation calculated from the 2X2 contingency table presented in Table 1. The second, γ-conf, is Retrieval, Monitoring and Report Bias 16 the correlation between confidence ratings and accuracy as derived from 2 (accuracy: correct/incorrect) X 6 (confidence scale level: 1-6) contingency tables. Both can be considered measures of monitoring, only γ-rc, like A’, is based on reporting behavior of participants, whereas γ-conf is based on confidence ratings. Just as with the analysis on the SDT measure (A’), γ-rc and γ-conf were analyzed with separate 2 (group: trial/phase) X 2 (cue type: strong/weak) ANOVAs. Both ANOVAs revealed only a significant main-effect of cue type, F(1,28)=39.93, MSE=.188, p<.001 and F(1,30)=63.87, MSE=.128, p<.001, for γrc and γ-conf, respectively. γ-rc and γ-conf were both higher in the weak cue condition (both Ms=.95) than in the strong cue condition (both Ms=.24). These results are consistent with the analysis of A’. The SDT index of report bias (B”D) was compared with Koriat and Goldsmith’s (1996a) index of bias (Prc). As before, the comparison was of interest because B”D is based on reporting behavior, whereas Prc is based on confidence ratings. To calculate Prc, first the number of reported and withheld candidates was determined for each level of confidence. Then five fit ratios (Koriat and Goldsmith) were calculated for each participant, representing placement of the criterion at “2” through “6.” Fit ratios were determined by averaging (weighted according to the number of observations available) (1) the proportion of withheld candidates assigned confidence below a given criterion and (2) the proportion of reported candidates assigned confidence equal to or higher than the criterion. The scale value that yielded the highest fit ratio represented Prc for that participant. If more than one scale value was associated with the highest fit ratio, the mean of the values was taken as Prc. A 2 (group: trial/phase) X 2 (cue type: strong/weak) ANOVA revealed a significant main effect of cue type, F(1,28)=5.57, MSE=.596, p<.05; the mean Prc was more conservative in the strong cue condition (4.03) than in the weak cue condition (3.57). This result is consistent with the analysis on the SDT measure of report bias (B”D). Although both the analysis of B”D and Prc revealed significant effects of cue type, the effect sizes were considerably different (η=.74 and .17 for B”D and Prc, respectively). This difference may actually reflect dissimilarity in the nature of the information that each index is measuring. Although B”D is sensitive to both “true” criterion shifts between conditions and differences in the mean subjective confidence of the underlying distributions (as explained above), Prc is only sensitive to the former. That is, Prc is designed to “hone in on” the particular subjective confidence level that determines whether or not a candidate response is reported. Assuming that participants assign scale values consistently Retrieval, Monitoring and Report Bias 17 between freeand forced-report and between experimental conditions, then Prc will not be influenced by differences in reporting behavior due to confidence distribution shifts, only by changes to the subjective level of confidence associated with the criterion. Consequently, the cue type effect size may have been smaller for Prc than B”D simply because Prc is sensitive to one influence whereas B”D is sensitive to two influences. Some readers may consider it a problem that differences in B”D can be attributed to either criterion or distribution shifts. However, this will obviously depend on what information one is hoping to glean from the bias measure. Prc may be preferable to B”D if one is hoping to estimate the subjective confidence level at which the report criterion is set. But B”D may be preferable if one is simply wanting a summary statistic of the prevalence of a particular type of behavior to a given set of items (e.g., tendency to respond “yes” in traditional SDT or the tendency to report candidates in the current treatment). Furthermore, it is not clear what the relationship is between Prc and free-/forced-report retrieval discrepancy. Suppose, for example, that Prc increases from experimental condition A to B. If this were true of B”D, and monitoring did not vary, then the prediction is clear: the free/forced retrieval discrepancy should be greater for condition B than condition A. However this is not necessarily true for Prc. For example, if the means of the underlying incorrect and correct candidate distributions also increase from condition A to B by approximately the same amount as Prc (again, keeping monitoring constant between the conditions), then the free-/forced retrieval discrepancy will be invariant between the conditions (i.e., the number of misses – cell c – will not vary). In short, it is the prevalence of reporting (B”D) that is important when generating report-bias-related predictions, not the confidence scale value that marks whether or not candidates will be reported (Prc). There are other reasons to use the SDT bias index instead of Prc: B”D is not subject to some problems that are inherent with measures such as Prc where there is an attempt to map subjective confidence to external scale values. Consider, for example, the phase group. Participants in this group assigned confidence ratings to initially withheld answers in a forced-report block at the end of the experiment, consisting wholly of the cues that were left blank in an earlier phase of the experiment. Most of these cues were likely to elicit mostly incorrect candidate answers in a limited range of (low) subjective confidence. This would likely lead participants to redistribute the 6 objective confidence scale values over this limited range of low subjective confidence, rather than strictly adhering to using only Retrieval, Monitoring and Report Bias 18 scale values below the report criterion. Should this occur, the meaning of Prc as a measure of the report criterion becomes questionable. In summary, the SDT measure of monitoring (A’), the gamma correlation based on the report criterion (γ-rc) and the gamma correlation based on confidence ratings used by Koriat and Goldsmith (1996a) (γ-conf) all indicated better monitoring in the weak cue condition than in strong cue condition. Also, the SDT measure of report bias based on reporting behavior (B”D) and Prc based on fit ratios weighed according the number of reported and withheld responses, both indicated that the bias was more liberal for weak cues than strong cues. Consistent with both these findings, free-report retrieval underestimated forced-report retrieval more in the strong cue condition than in the weak cue condition. This underestimation of retrieval was sufficiently great for strong cues that when output was forced, they facilitated performance relative to no cues, a result that is at odds with the conclusions of Thomson and Tulving (1970). General Discussion In the present research, a model analogous to those seen in SDT was used to estimate retrieval, report bias and monitoring in Thomson and Tulving’s (1970) classic cued-recall task. The effect of retrieval cues, when participants were given the option of omitting responses to cues, replicated the effect found in Thomson and Tulving, who also allowed omissions: reinstating the same cues at retrieval that were encoded specifically with to-be-remembered material, facilitated memory performance more than did nonreinstated cues. Indeed, under these reporting conditions, nonreinstated cues did not facilitate memory performance at all relative to no cue recall. However, because Thomson and Tulving did not force participants to provide answers to all the cues, their measure of memory performance was free-report retrieval as it is defined here. The SDT model assumes that free-report retrieval will underestimate forced-report retrieval to the extent that report bias is conservative and monitoring is poor. The results showed both of these conditions were met for strong cues and as expected, the free-forced retrieval difference was greater in the strong cue condition than in the weak cue condition. Only 28% of the correct answers potentially produced to strong cues were actually offered in free-report, compared to 92% for weak cues. Consistent with the results obtained here, other researchers have found ways of improving performance to strong, nonreinstated cues using Thomson and Tulving’s (1970) paradigm. However, Retrieval, Monitoring and Report Bias 19 some of these manipulations may also have been confounded with report bias differences, just as Thomson and Tulving’s manipulation of cue type was. For example, Santa and Lamwers (1974, Experiment 1) replicated a portion of Thomson and Tulving’s methodology by having the critical lists in the no cue and strong cue conditions preceded by practice study-test lists. Each practice study list consisted of weak paired associates, and each practice test list consisted of reinstated weak cues and new, unrelated distractor cues. The final, critical study list of weak associates was followed by a recall phase with no cues or one with a mixture of nonreinstated, strong cues and unrelated distractor cues. They reasoned that participants performed poorly in the strong cue condition of Thomson and Tulving’s experiments because the instructions failed to warn participants that correct responses on the final list were strong associates of the cues, not weak associates as in the previous lists. Without the warning, Santa and Lamwars argued, participants may continue to generate weak associates to the cues on the final critical list, and consequently respond incorrectly in the majority of cases. To alleviate the problem of the surprise switch from weak to strong associates in the final list for participants in the strong cue condition, Santa and Lamwars (1974) told participants in the warning group that the cues were not presented in the study list, but that some of them were strong associates of the target words. This small inclusion had a dramatic effect on performance. Replicating Thomson and Tulving’s (1970) results, the mean number of words recalled for participants provided with strong cues and no warning was equivalent to performance in the no cue condition (Ms=5.33 and 5.40, respectively). However, for participants given strong cues and told that the correct responses were strong associates of the cues, performance jumped to 10.07 words recalled. Thus, Santa and Lamwers’ results suggest that it is critical to warn participants of the nature of the association between the cues on the test list and the target words so that they can adjust their mental set for the final, critical list. However, before accepting the mental set interpretation out of hand, it is worth noting that their warning manipulation was confounded with a difference in report bias. The mean responses to distractor cues was .33 in the strong cue-no warning group, which more than doubled to .80 in the strong cuewarning group. A shift in report bias was shown to be enough in the present experiment to boost performance to strong cues, without the need for any warning. Thus, it is conceivable that Santa and Lamwars’ instruction was effective simply because it varied participants’ report criterion, not because it changed their mental set. Retrieval, Monitoring and Report Bias 20 There are two important points to be made about the reinterpretation of Thomson and Tulving’s (1970) and Santa and Lamwars’ (1974) results in terms of the SDT framework. First, it is clear that not all researchers are considering report bias and its effect on indices of recall performance in their experiments. For example, had Santa and Lamwars been studying recognition, the difference in bias would likely have been an important consideration because of potential distortions to indices like percent correct recognition. However, in the context of recall, report bias and its potential to undermine even major conclusions of the research, is not even discussed. Second, the reinterpretation shows that the present framework, in which report bias, retrieval and metamemory parameters are considered separately, is both feasible and potentially enlightening when applied to experimental data. The framework offers a synthesis of memory, metamemory and decision making mechanisms which is much needed in the recall literature (see Roediger & Payne, 1985). Appropriateness of Measures Confidence ratings were used to derive γ-conf and Prc, measures for monitoring and report bias, respectively, advocated by Koriat and Goldsmith (1996a). In contrast, two measures of monitoring (A’ and γ-rc) and a measure of report bias (B”D) were dependent on participants’ reporting behavior – the distribution of responses across the four cells of 2X2 contingency table shown as Table 1. With the current paradigm, measures based directly on reporting behavior are probably preferable to those based on confidence ratings because of the potential for inconsistency in the mapping between subjective confidence and the objective scale values between freeand forced-report. With the methodology advocated here, confidence data were gathered when responses were offered, meaning that some confidence data were collected in forced-report and some in free-report. In contrast, Koriat and Goldsmith (1996a) gathered all confidence data from participants during a forcedreport stage of the experiment and assumed that both the generated candidates and their associated confidence would be same as at free-report. Their methodology is more likely to lead to consistent application of the scale values than is the current methodology because the confidence data were all collected during the same phase of the experiment. However, even with their methodology, they found that a few responses to the same test questions did not match between freeand forced-report and these responses had to be eliminated from the analysis. Furthermore, even for candidates that were similar between freeand forced report, there is no guarantee that the subjective confidence level in Retrieval, Monitoring and Report Bias 21 those candidates was the same. These cases are more tricky to pinpoint because Koriat and Goldsmith did not have participants provide confidence at free-report. With the general knowledge questions that they used, the candidates generated and their associated confidence may be a fairly stable. However, with other materials intended to be testing episodic memory such as the paired associates used here, it is not clear how stable either candidate generation or confidence might be across time. For these reasons, I recommend using the current procedure for gathering candidate responses and confidence data, as long as confidence rating-based indices of performance are avoided. But apart from the problems associated with using confidence data to derive performance indices, there are other reasons to prefer indices based on reporting behavior. What could be a more direct measure of participants’ monitoring of the reportability of retrieved candidates, than their decisions regarding whether or not to report them? Confidence ratings are peripheral to this issue; in fact, the SDT measures can, and perhaps should, be derived without considering confidence data at all. Type 1 versus Type 2 Signal Detection Models Some readers may recognize the similarity between the SDT model used in this research and early, so-called type 2 SDT models (e.g, Banks, 1970; Clarke, Birdsall & Tanner, 1959; Healy & Jones, 1973; Lockhart & Murdock, 1970; Murdock, 1966, 1974). The main differences between type 1 and type 2 SDT are the method by which the underlying item distributions are generated and the nature of the dimension over which items are distributed. In type 1 (stimulus contingent) SDT, the experimenter generates the items by choosing them for the recognition experiment and the items are assumed to be distributed over activation, strength or familiarity. On the other hand, in type 2 (response contingent) SDT, the items are generated by participants when cues initiate a search of long-term memory. Furthermore, generated items are assumed to be distributed over confidence, not strength. Although application of type 1 SDT has been widespread in research on recognition memory, the use of type 2 SDT with recall data has been limited. This may be partly attributable to the fact that there has been some confusion in the literature regarding the distinction between type 1 and 2 SDT, which led to some inappropriate criticism of type 2 SDT (see discussion and examples of this confusion in Healy & Jones, 1973). For example, if it is erroneously assumed that the underlying dimension over which items are distributed is strength for both type 1 and type 2 SDT, then discrimination reflects differences in strength Retrieval, Monitoring and Report Bias 22 between old and new items. Thus, under this assumption, manipulations thought to enhance the strength of old items should also enhance type 2 discrimination. However, because discrimination more accurately represents a metacognitive index with type 2 SDT, manipulations that enhance memory for items (retrieval) should not (and do not) necessarily enhance discrimination (monitoring). To illustrate, consider an experiment by Murdock (1966). He presented participants with short lists of unrelated paired associates. Following the list, one of the cues was presented and participants were required to provide the associated target, guessing if necessary, and give a confidence rating. He found a large effect of serial position on retrieval; the proportion of targets correctly provided in forcedresponse to cues was .29, .29, .44, .59 and .85 for serial positions 1-5, respectively. However, the effect of serial position on discrimination (as derived from confidence ratings) was negligible. That is, to use the current terminology, serial position affected forced-report retrieval but not monitoring. Murdock’s (1966) data are at odds with a strength interpretation of discrimination. In his experiment, more recent items were more retrievable, so they should also have had more strength (higher d’). If, instead, the type 2 discrimination index is more appropriately interpreted as a metacognitive index of monitoring rather than strength, then Murdock’s results are understandable; serial position affected participants’ ability to retrieve correct candidate responses, but did not affect the relationship between assigned confidence and accuracy. Although there may be situations where items may be both more retrievable at forced-report and monitored better, rendering a strength interpretation of type 2 discrimination plausible, the present research, and that of Koriat and Goldsmith (1996a), Murdock and others, suggests that the two memory parameters are dissociable. As Lockhart and Murdock (1970) have stated, “it is gross oversimplification to say that the type 2 d’ is measure of ... strength” (p. 107). A second reason that type 2 SDT may not have been adopted in the past may relate to the different ways that the item distributions are generated. Because participants generate their own candidates, with the best candidates being distributed across confidence, it is not clear what the nature of the underlying distributions might be like. For example, monitoring may be very high for some materials, meaning that participants’ responses in free-report are all correct and assigned very high confidence, whereas those in forced-report are all incorrect and assigned very low confidence. Such materials would produce highly skewed distributions, peaking at the extremes of the confidence Retrieval, Monitoring and Report Bias 23 dimension. The distributions associated with these materials would render parametric indices such as d’ and â unsuitable because they assume normal underlying distributions of equal variance. However, this is more of a problem with using d’ and â than it is a criticism of type 2 SDT more generally. To circumvent assumptions regarding the nature of the underlying distributions in the present research, I used a nonparametric index of discrimination (A’, which represents the area under the Receiver Operating Characteristic [ROC] curve) and its associated measure of bias (B”D, Donaldson, 1992). Until more is known about the nature of the underlying distributions, nonparametric statistics are probably the safer bet. With the growing interest in metacognitive processes over the last few decades and the need for analytical tools to estimate and control report bias in cued testing paradigms, the value of type 2 SDT models may be more readily accepted today. Certainly there is no need to dismiss them out of hand simply because their interpretation is inconsistent with more traditional type 1 SDT models. As Healy and Jones (1973) have stated, “...type II analysis can illuminate the important evaluation processes which go on in recall and which may affect the measure of memory loss in such experiments” (p. 340). The “Lucky Guessing” Critcism and The Status of Encoding Specificity If memory researchers were surveyed, asked to list the most important principles of memory, it is likely that the encoding specificity principle would come out on top. The importance of reinstating the same retrieval cues at test, that were encoded specifically with the to-be-remembered material during study, has been demonstrated in many different domains with a wide variety of materials. Certainly, I am not arguing that the current experiment undermines this general principle of memory. However, I am arguing that there are flaws in both the design and interpretation of Thomson and Tulving’s (1970) original experiments using strong and weak cues that generated interest in the encoding specificity principle in the first place. One obvious criticism of using forced-report retrieval is that participants in the forced-report condition may have responded correctly to several cues because of memory for pre-experimentally derived associations (PEDAs); that is, a given candidate may have been simply the first word related to the cue that “came to mind,” and would have been produced even by participants not given any study list. This “contamination” of the retrieval measure with PEDAs in the forced-report condition is likely to have been particularly great with strong cues because the normative probability of producing a target Retrieval, Monitoring and Report Bias 24 word was, by definition, higher than with weak cues. A critic might argue, therefore, that Thomson and Tulving’s (1970) choice of free-report retrieval is preferable to forced-report retrieval because participants are not affected by PEDAs in free-report, which only serve to confound the strong/weak cue comparison. Before accepting this argument out of hand, however, I think it is important to consider its assumptions. First, it assumes that participants are able to more or less perfectly monitor retrieved information that derives from a pre-experimental source; that is, participants withhold all candidates that are retrieved because of PEDAs in the strong cue/free-report condition. Tulving and Thomson (1973) explicitly supported this assumption by suggesting that “guessing from semantic memory” occurred rarely if free-report retrieval was used as the memory index (e.g., see discussion on p. 356). However, the argument also assumes that participants more or less perfectly monitor the retrieval of episodic content derived from the study list; that is, participants withhold none of the experimentally derived candidates that they retrieve in the strong cue/free-report condition. Thomson and Tulving implicitly supported this assumption by treating free-report omissions as incorrect responses. By counting omissions as errors, the assumption must have been that if responses were gathered to cues initially left blank, they would all have been wrong – either the wrong word would have been produced, or the right word would have been produced which wasn’t really right because it was a “lucky guess” (i.e., not “true” episodic retrieval). This latter assumption is particularly questionable. First, the extra search time and effort induced by forcing report from participants to strong cues is likely to have produced at least some new experimentally derived candidate memories. This is particularly likely given the large number of correct answers that were withheld to strong cues in free-report in the current research (i.e., poor monitoring and conservative report bias). Additionally, Bahrick (1969, 1970) found that nonreinstated cues consistently elicited target words in forced report, across a wide range of associative strengths, with greater probability if the target words were previously studied in a paired associate study list than if no pairedassociate list was studied. That is, the probability of target words being elicited by nonreinstated cues, after participants studied a paired-associate list containing the target words, was higher than the normative probability of the targets being generated by the same cues using PEDAs. The results of Bahrick’s research suggest that nonreinstated cues that are strong associates of the target words result in memory of targets derived from encoding the experimental study list, not just from PEDAs. Retrieval, Monitoring and Report Bias 25 It is necessary to point out, however, that this “lucky guessing” criticism speaks only to whether or not my results undermine the encoding specificity principle, which was never the intended purpose of the research. My research was conducted to demonstrate the problems that can arise by failing to consider issues such as monitoring and report bias when using free-report indices of retrieval and to use Thomson and Tulving’s (1970) original research as an example. At the very least, their use of free-report carries with it some very strong, unwarranted assumptions regarding the nature of withheld information which could only have obtained support if output was forced. Thus, Thomson and Tulving’s experiments, by themselves, provide, at best, weak evidence for the encoding specificity principle, despite the fact that they are considered classic and cited in textbooks throughout the world as providing its very foundation. This remains true regardless of whether or not the correct withheld candidates in free-report were “lucky guesses” or “true” episodic retrieval. Today, there are various analytical tools available to cognitive psychologists (e.g., the remember/know paradigm, Higham, 1998; Tulving, 1985, the independent scales technique, Higham & Vokey, 2001; the process dissociation paradigm, Jacoby, 1991 and the opposition paradigm, Higham, Vokey & Pritchard, 2000; Jacoby, Woloshyn & Kelley, 1989) that allow for the separation of different sources of retrieval. Currently, we are investigating the nature of the reported and withheld information in the strong/weak cue paradigm using some of these tools. Hopefully, this line of research will allow us to determine the nature of the correct withheld information and more directly assess the role of encoding specificity in this paradigm. Retrieval, Monitoring and Report Bias 26 ReferencesBahrick, H.P. (1969). Measurement of memory by prompted recall. Journal of ExperimentalPsychology, 79, 213-219.Bahrick, H.P. (1970). Two-phase model for prompted recall. Psychological Review, 77, 215-222.Banks, W.P. (1970). Signal detection theory and human memory. Psychological Bulletin, 74,81-99.Clarke, F.R., Birdsall, T.G., & Tanner, W.P. (1959). Two types of ROC curves and definitions ofparameters. Journal of the Acoustical Society of America, 31, 629-630.Donaldson, W. (1992). Measuring recognition memory. Journal of Experimental Psychology:General, 121, 275-277.Dywan, J., & Bowers, K. (1983). The use of hypnosis to enhance recall. Science, 222, 184-185.Fisher, R.P. (1996). Misconceptions in design and analysis of research with the cognitiveinterview. Psycoloquy, 7(35), witnessmemory.12.fisher.Fisher, R.P., & Geiselman, R.E. (1992). Memory enhancing techniques for investigativeinterviewing: The cognitive interview. Springfield, IL: Thomas.Grier, J.B. (1971). Nonparametric indexes for sensitivity and bias: Computing formulas.Psychological Bulletin, 75, 424-429.Healy, A.F., & Jones, C. (1973). Criterion shifts in recall. Psychological Bulletin, 79, 335-340.Higham, P.A. (1998). Believing details known to have been suggested. British Journal ofPsychology, 89, 265-283.Higham, P.A., & Roberts, W.T. (1996). Measuring recall performance. Psycoloquy, 7(38),witness-memory.13.higham.Higham, P.A., & Vokey, J.R. (2001). Illusory recollection. Manuscript submitted forpublication.Higham, P.A., Vokey, J.R., & Pritchard, J.L. (2000). Beyond task dissociations: Evidence forcontrolled and automatic influences in artificial grammar learning. Journal of Experimental Psychology:General, 129, 457-470. Retrieval, Monitoring and Report Bias 27 Jacoby, L.L. (1991). A process dissociation framework: Separating automatic from intentionaluses of memory. Journal of Memory & Language, 30, 513-541.Jacoby, L.L., Woloshyn, V., & Kelley, C. (1989). Becoming famous without being recognized:Unconscious influences of memory produced by dividing attention. Journal of Experimental Psychology:General, 118, 115-125.Klatzky, R.L., & Erdelyi, M.H. (1985). The response criterion problem in tests of hypnosis andmemory. The International Journal of Clinical & Experimental Hypnosis, 33, 246-257.Koriat, A. & Goldsmith, M. (1994). Memory in naturalistic and laboratory contexts:Distinguishing the accuracy-oriented and quantity-oriented approaches to memory assessment. Journalof Experimental Psychology: General, 123, 297-315.Koriat, A & Goldsmith, M. (1996a). Monitoring and control processes in the strategic regulationof memory accuracy. Psychological Review, 103, 490-517.Koriat, A & Goldsmith, M. (1996b). Memory metaphors and the real-life/ laboratory controversy:Correspondence versus storehouse conceptions of memory. Behavioural & Brain Sciences, 19, 167-228.Koriat, A., & Goldsmith, M. (1996c). Memory as something that can be counted versusmemory as something that can be counted on. In D. Hermann, C. McEvoy, C. Hertzog, P. Hertel, & M.Johnson (Eds.), Basic and applied memory research: Practical applications, Vol. 2 (pp. 3-18).Hillsdale, NJ: Erlbaum.Koriat, A., & Goldsmith, M., & Pansky, A. (2000). Toward a psychology of memory accuracy.Annual Review of Psychology, 51, 481-537.Lockhart, R.S. & Murdock, B. B. (1970). Memory and the theory of signal detection.Psychological Bulletin, 74, 100-109.Memon, A., & Higham, P.A. (1999). A review of the cognitive interview. Psychology, Crime &Law, 5, 177-196.Memon, A. & Stevenage, S.V. (1996). Interviewing witnesses: What works and what doesn't?Psycoloquy, 7(6), witness memory.1.memon.Murdock, B.B. (1966). The criterion problem in short term memory. Journal of ExperimentalPsychology, 72, 317-324.Murdock, B.B. (1974). Human memory: Theory and data. New York: Wiley. Retrieval, Monitoring and Report Bias 28 Nelson, T.O. (1984). A comparison of current measures of the accuracy of feeling-of-knowingpredictions. Psychological Bulletin, 95, 109-133.Nelson, T.O. (1996). Gamma is a measure of the accuracy of predicting performance on oneitem relative to another item, not of the absolute performance on an individual item: Comments onSchraw (1995). Applied Cognitive Psychology, 10, 257-260.Roberts, W.T., & Higham, P.A. (2001). Selecting accurate statements from the cognitiveinterview using confidence ratings. Manuscript submitted for publication.Roediger, H.L., & Payne, D.G. (1985). Recall criterion does not affect recall level orhypermnesia: A puzzle for generate/recognize theories. Memory & Cognition, 13, 1-7.Santa, J.L., & Lamwers, L.L. (1974). Encoding specificity: Fact or artifact? Journal of VerbalLearning & Verbal Behavior, 13, 412-423.Schraw, G. (1995). Measures of feeling-of-knowing accuracy: A new look at an old problem.Applied Cognitive Psychology, 9, 321-332.Snodgrass, J.C., & Corwin, J. (1988). Pragmatics of measuring recognition memory:Applications to dementia and amnesia. Journal of Experimental Psychology: General, 117, 34-50.Thomson, D.M., & Tulving, E. (1970). Associative encoding and retrieval: Weak and strongcues. Journal of Experimental Psychology, 86, 255-262.Tulving, E. (1985). Memory and consciousness. Canadian Psychology, 26, 1-12.Tulving, E., & Thomson, D.M. (1973). Encoding specificity and retrieval processes in episodicmemory. Psychological Review, 80, 352-373.Wright, D.B. (1996). Measuring feeling of knowing: Comment on Schraw (1995). AppliedCognitive Psychology, 10, 261-268. Retrieval, Monitoring and Report Bias 29 Appendix AThe 100 target words, 100 strong associates, 100 weak associates and the probabilities of targetproduction for each weak and strong associate.TargetStrongWeak P(target/strong) P(target/weak)WordsAssociateAssociateloftatticpigeon0.150.01secretagentsocieties 0.190.01planeairelevation 0.10.02clockalarmantique0.340.01shipanchornelson0.280.01moneybilldealer0.220.01knifebladearrow0.280.01JamesbondWalter0.250.01readbookcarrots0.180.01beerbottletanks0.170.01brigadechargepostal0.060.01cheddarcheesescotch0.120.01hammerchiseltilt0.410.01teacoffeecabby0.390.01hotcoldfooted0.340.02Russiacommunismtractor0.20.01stateconditionunion0.190.01goodsconsumertextile0.280.01foodcookwarmer0.310.01lawcriminalfamily0.160.01hilldalechalk0.260.02todaydatetopical0.140.01owedebtvow0.180.01seadeepnauseous 0.180.01 Retrieval, Monitoring and Report Bias 30 lifedeathequation 0.190.01fragiledelicatespindly0.110.01sameequalparrot0.150.01continentEuropemetric0.150.01preciseexacttidiness0.330.01prisoncellquilt0.420.01necessaryessentialunable0.330.01everlastingeternalprocession 0.150.01blueeyesbells0.20.01fictionfactbible0.420.01truefalsehollow0.320.01downfallenabdicate 0.110.01lastfirstfifth0.510.01blooddonorbats0.880.01leaderfollowwisdom0.160.01kindgenerousvarious0.310.01gapgenerationvoid0.20.01ladygentlemanseventy0.380.01handglovetorch0.660.01whiteblackpastry0.580.01transplantheartsurgery0.180.01neckgiraffesnob0.340.01enemyfoewatchtower 0.420.01lightneonwick0.60.01softhardoption0.440.01lowhighwages0.60.01glorysplendorwings0.120.01homeaddressfarms0.380.01thoughtideaimpulse0.230.01 Retrieval, Monitoring and Report Bias 31 guiltyverdictwary0.330.01wrongincorrectcane0.620.01godalmightyVenus0.80.01knowledgeknowingunsure0.160.01steelgirderssubmarine 0.330.01birdnestgibbon0.50.01diamondjewelrare0.170.01paperjournaldarts0.340.01murderhomicidevillain0.350.01queenroyaltyvisit0.510.01doorhingevan0.670.01tieknotweld0.340.01sheepflockvalley0.760.01foundlostjungle0.610.01rungladderouter0.190.01womanmanailing0.670.01cowmoodumb0.710.01earlylateunready0.350.01smalllargeimply0.450.01leastminimumtotal0.140.01mistressmastersir0.260.01oldnewearl0.640.01daynightfeast0.530.01oneunitytrack0.330.01treeoaknapalm0.670.01subjectpredicateundergo0.270.01dutyobligationunfit0.130.01evenoddwere0.270.01presentpastbrain0.410.01 Retrieval, Monitoring and Report Bias 32 feardreadflutter0.630.01failpasslist0.210.01sentencephraseverb0.240.01explosionpopulationstress0.320.01bedmattresscrimson0.610.01meatbutcherkeep0.480.01backfrontweeding0.580.01armysalvationserge0.730.01planschemesystematic 0.450.01electricshockscope0.210.01wildtameeerie0.240.01nervoustenseunrest0.190.01youthanklad0.740.01overunderleper0.270.01bottomtopfidget0.510.01handyusefulumbrella 0.150.01cleanervacuumsoot0.430.01rubbishdumppoetry0.450.01 Retrieval, Monitoring and Report Bias 33 Author NotesPhilip A. Higham, Department of Psychology, University of Southampton. Preparation of thisarticle was supported by a Natural Sciences and Engineering Research Council of Canada (NSERC)research grant to the author. Portions of this research were presented at the 9 annual meeting ofCanadian Society of Brain, Behaviour and Cognitive Science, June 1999 in Edmonton, Alberta, Canadaand at the 40 annual meeting of the Psychonomics Society, November, 1999 in Los Angeles,California, USA. I thank Casey Hrabchuk for helpful discussion and research assistance and PeterWinters for additional research assistance. I also thank John Dunlosky, Asher Koriat and MorrisGoldsmith for helpful comments on earlier drafts of this article. Correspondence concerning this articleshould be addressed to Philip A. Higham, Department of Psychology, University of Southampton,Highfield, Southampton, U.K., SO17 1BJ. E-mail: [email protected] Retrieval, Monitoring and Report Bias 34 Footnotes1. At the time the materials were being prepared, the Ediburgh Associative Thesaurus wasavailable on-line at http://www.cis.rl.ac.uk/proj/psych/eat.html. However, a more recent check indicatedthat the thesaurus was no longer available. For the reader’s convenience, the target words, weakassociates and strong associates are listed in Appendix A.2. The point system was dropped at forced-report to keep participants motivated during the task.I suspected that if participants chose to withhold a response to avoid penalty, but were then forced torespond to the cue and be penalized anyway, that they would find the task pointless and stop trying toperform well. Incentive was also dropped in forced-report in Koriat and Goldsmith’s (1996a) design.3. A response was considered correct in the cued-recall (trial and phase) groups only if theassociated target was given in response to either the strong or weak cue. Targets associated with othercues were not counted as correct. However, because there were no cues presented in the no cuegroup, this constraint could no longer be applied; instead, any target produced anywhere in the freerecall list was counted as a correct response, given that it was not repeated. Thus, retrieval in the nocue group was defined as the number of correctly recalled targets divided by 100 (the number of possibleretrievable targets).4. Scale value “1” was not included in the analysis because all (1.0) reported items are assignedconfidence “1” or higher, and none (0.0) of the withheld items are assigned a confidence value less than“1,” meaning that it always had a fit ratio of .50. Retrieval, Monitoring and Report Bias 35 Table 1A 2X2 Contingency Table and Formulae Used to Derive the Various Measures Discussed in the Text. Candidate AnswerResponseCorrectIncorrectReportedabWithheldcdNote. Free-report retrieval=a/(a+b+c+d); forced-report retrieval=(a+c)/(a+b+c+d); hit rate (h)= a/(a+c);false alarm rate (fa)=b/(b+d); monitoring=A’=.5 + [(h-fa)(1+h-fa)]/[4h(1-fa)];report bias= B”D =[(1-h)(1-fa)-h*fa]/[(1-h)(1-fa)+h*fa] Retrieval, Monitoring and Report Bias 36 Table 2The a, b, c and d Cells of the 2X2 Contingency Tables for Strong and Weak Retrieval Cues as aFunction of Experimental Group Candidate AnswerCue Type and ResponseCorrectIncorrectPhase GroupWeak CuesReported13.317.44Withheld.9425.19Strong CuesReported3.386.94Withheld9.1327.69 Trial GroupWeak CuesReported10.949.88Withheld1.0624.56Strong CuesReported3.448.63Withheld8.3127.19 Retrieval, Monitoring and Report Bias 37 Figure CaptionsFigure 1. Free-and forced-report retrieval in the weak (reinstated) and strong (nonreinstated) cueconditions. Retrieval in the no cue condition is shown as the dotted line.
منابع مشابه
Encoding specificity manipulations do affect retrieval from memory.
In a recent article, P.A. Higham (2002) [Strong cues are not necessarily weak: Thomson and Tulving (1970) and the encoding specificity principle revisited. Memory &Cognition, 30, 67-80] proposed a new way to analyze cued recall performance in terms of three separable aspects of memory (retrieval, monitoring, and report bias) by comparing performance under both free-report and forced-report inst...
متن کاملValidation of an ingestible temperature data logging and telemetry system during exercise in the heat
Aim: Intestinal temperature telemetry systems are promising monitoring and research tools in athletes. However, the additional equipment that must be carried to continuously record temperature data limits their use to training. The purpose of this study was to assess the validity and reliability of a new gastrointestinal temperature data logging and telemetry system (e-Celsius™) during water ba...
متن کاملMeasuring Patient Adherence to Malaria Treatment: A Comparison of Results from Self-Report and a Customised Electronic Monitoring Device
BACKGROUND Self-report is the most common and feasible method for assessing patient adherence to medication, but can be prone to recall bias and social desirability bias. Most studies assessing adherence to artemisinin-based combination therapies (ACTs) have relied on self-report. In this study, we use a novel customised electronic monitoring device--termed smart blister packs--to examine the v...
متن کاملMonitoring patients' response to acute migraine treatment: a headache attack report form.
A migraine patient’s response to a medication taken for acute headache may be difficult to evaluate. An accurate assessment may be difficult to achieve when one is obtaining the headache history in a clinic, days to weeks after the attack(s) have occurred. Treatment response is subject to recall bias, and influencing that bias are expectations with regard to the achievement of pain-free status,...
متن کاملReporting of clinical trials: a review of research funders' guidelines
BACKGROUND Randomised controlled trials (RCTs) represent the gold standard methodological design to evaluate the effectiveness of an intervention in humans but they are subject to bias, including study publication bias and outcome reporting bias. National and international organisations and charities give recommendations for good research practice in relation to RCTs but to date no review of th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001